Unusual Deaths in Mexico

2017-04-05: VERY Preliminary!

Materials

This is a practice session of dplyr and ggplot2 using a case study related to tidyr package.

The case is about investigating the causes of death in Mexico that have unusual temporal patterns within a day. The data of individual-level mortality in 2008 has the following pattern of deaths by hour;

Temporal pattern of all causes of death

Temporal pattern of all causes of death

Do you notice anything unusual or unexpected? The figure shows several peaks within a day, indicating some increased risks of deaths in certain times of the day. What would be generating these patterns?

The author finds;

The causes of [unusual] death fall into three main groups: murder, drowning, and transportation related. Murder is more common at night, drowning in the afternoon, and transportation related deaths during commute times. The pale gray line in the background shows the temporal course across all diseases [i.e., causes] [@Wickham2014].

Causes of death with unusual temporal courses. Overall hourly death rate shown in grey. Causes of death with more than 350 deaths over a year. Note that the y-axes are on different scales.

Causes of death with unusual temporal courses. Overall hourly death rate shown in grey. Causes of death with more than 350 deaths over a year. Note that the y-axes are on different scales.

There are two datasets: deaths containing the timing of deaths and coded causes and codes containing the lookup table for the death codes.

The dataset deaths has over 53,000 records (rows), so we use head() to look at the first several rows.

# "deaths08b" is a renamed dataset with easier-to-read column names 
head(deaths08b) 
##   Year of Death (yod) Month of Death (mod) Day of Death (dod)
## 1                2008                    1                  1
## 2                2008                    1                  1
## 3                2008                    1                  1
## 4                2008                    1                  1
## 5                2008                    1                  1
## 6                2008                    1                  1
##   Hour of Death (hod) Cause of Death (cod)
## 1                   1                  B20
## 2                   1                  B22
## 3                   1                  C18
## 4                   1                  C34
## 5                   1                  C50
## 6                   1                  C50

The dataset codes has 1851 records. The table below is generated by DT and webshot packages.

In the search box, you can type in key words like “bacteria”, “nutrition”, and “fever”, as well as “assault” and “exposure” to see what items are in the data.

We will reproduce this case study and get some practice with dplyr and ggplot2.

Arts & Crafts

Let’s recap key ingredients of dplyr and ggplot2 from the introduction in section @ref(intro).

The six important functions in dplyr are:

  • filter(): extracts rows (e.g. observations) of a data frame. We put logical vectors in its arguments.

  • select(): extracts columns (e.g. variables) of a data frame. We put column names in its arguments.

  • arrange(): orders rows of a data frame. We put column names in its arguments.

  • summarise(): collapses a data frame into summary statistics. We put summary functions (e.g. statistics functions) using column names in its arguments.

  • mutate(): creates new variables and adds them to the existing columns. We put window functions (e.g. transforming operations) using column names in its arguments.

  • group_by(): assigns rows into groups in a data frame. We put column names in its arguments.
    We use piping operator %>% (read as then) to translate a sentence of sequential instructions. For example, take dataset deaths08, %>% (then) group the data by month of death, and %>% (then) summarise the grouped data for the number of records/observations.

deaths08 %>% 
  group_by(mod) %>%   # mod: month of death
  summarise( nobs = n() )  # n(): a dplyr funciton to count rows 
## # A tibble: 12 × 2
##      mod  nobs
##    <int> <int>
## 1      1 49002
## 2      2 41685
## 3      3 44433
## 4      4 39845
## 5      5 41710
## 6      6 38592
## 7      7 40198
## 8      8 40297
## 9      9 39481
## 10    10 41671
## 11    11 43341
## 12    12 42265

ggplot2 graphics consist of three components:

  • data: a data frame e.g., the first argument in ggplot(data, ...).

  • aes: specifications for x-y variables, as well as variables to differentiate geom objects by color , shape, or size. e.g., aes(x = var_x, y = var_y, shape = var_z)

  • geom: geometric objects such as points, lines, bars, etc. e.g., geom_point(), geom_line(), geom_histogram()

We specify data and aes in ggplot() and then add geom objects followed by + symbol (read as add a layer of); e.g., ggplot(data = dataset, mapping = aes(x = ...)) + geom_point(). The order of layers added by + symbol is generally interchangeable.

Combined with %>% operator, we can think of the code as a sentence. For example, take dataset deaths08, %>% (then) plot with gglpot() with the aestetics of hour of day on the x-axis, + (and add a player of) geom object geom_histogram().

#  histogram version of the line-graph for the total number of deaths above
deaths08 %>% 
  ggplot(aes(x = hod)) + geom_histogram(binwidth = 1, color = "white") 
Histogram of deaths by hour of the day

Histogram of deaths by hour of the day

# summary by month of day and hour of day.
# e.g, Jan-1am, ..,Jan-11pm, Feb-1am,..., Feb-11pm, ...   
n_month_hour <- deaths08 %>%
  group_by(mod, hod) %>%
  summarise( nobs = n() )

n_month_hour %>%
  ggplot(aes(x = hod, y = nobs, color = as.factor(mod))) + 
  geom_point() 

# "last_plot() + " allows for adding more layers to the previous plot
last_plot() + geom_line()

Exercise

Now it is your turn. The exercise is to reproduce the above figure for the unusual causes of deaths.

  1. Download materials: case study paper and case study data

  2. Set working directly: setwd(your_directory)

  3. Load libraries: library(dplyr), library(ggplot2), library(MASS) (used for fitting data by robust regression)

  • Note 1: There is a minor error in the case study that the author accidentally kept several records of data from years other than 2008. This virtually has no effect on the results, and we would do the same to excatly reproduce his results.

  • Note 2: You could look at the codes in the paper itself for hints. But, the code is written with the functions of plyr package, which is the precesssor of dplyr. Do not load both plyr and dplyr libraries in the same R session; they do not seem to have good compatibility. Restart R if you accidentally loaded both.

Display overall hourly deaths

We will reproduce:

Hints:

  • Filter NA in the hour of day (hod) variable

  • Use group_by(), summarise(), n() to obtain death counts by group

  • Use ggplot() + geom_line() to produce plot

  • Use + labs( x = "x lable", y = "y label") for axis labels

  • see help file of scale_y_continous() for comma (use ?function_name for help)

Count deaths per hour, per disease

We will reproduce:

Panel (a) of the table contains the frequency (i.e. the number of rows) for each combination of hour of day (hod) and cause of death (cod), supplemented by the disease description in panel (b). Panel (c) shows the proportion (prop) of each hod-cod combination in the total deaths by the cod. Panel (d) contains the frequency and proportion (freq_all and prop_all) of the death counts by hour of day.

That is, panel (a) is the raw counts (e.g., frequency) of observations by each pair of hour of day (hod) and cause of death (cod), and panel (b) makes it easy to look up the cause of death (cod). Panel (c) converts this frequency of hod-cod pair into the proportion (i.e. relative frequency) in the total frequency of cod, so that we see how disproportinately large each hod-cod pair is within the particular cod. Panel (d), on the other hand, presents the frequency and relative frequency of deaths for each hour (hod); if each hour has the same probablity of death, we would see prop_all \(\approx\) 0.042 (by 1/24). Here we see the author’s idea of identifying “unusual deaths” by comparing “prop” of each hod-cod pairs for its deviation from to “prop_all”.

Hints for creating panels (a) and (b)

  • Use more than one variable in group_by()

  • Use summarise() with n() to obtain death counts by group

  • Use left_join() to add the information from dataset codes

Hints for creating panel (c)

  • Use mutate() with sum() on the joined dataset

Hints for creating panels (d)

  • Create a new data frame by using summarise() on the joined and mutated data frame. (summarise() will reduce the dimension of the data frame as its summmary, which is the basis of panel (d). Once the desired summary is created, we will merge it to the data frame of panels (a)-(c).)

  • Before using summarise() above, use group_by() to specify new grouping

  • First create freq_all via summarise() with n() and then create prop_all via mutate() with sum() (call this data frame overall_freq, which will be used again at the very end)

  • Use left_join() to join panels (a)-(c) and panel (d) (overall_freq), which we refer to as master_hod data frame.

Hints for extracting the same rows as in the table 16 above

  • Create a subset of the master_hod data under a new name

  • Use filter() to select cod being either “I21”, “N18”, “E84”, or “B16” and hod being greater or equal to 8 and smaller or less than 11

  • Use select() to pick columns in a desired order and arrange() to sort

Find outliers

We will reproduce:

We will create a deviation variable named dist by taking the mean of squared differences between prop and prop_all by cause of death (cod). The above figures show the the number of observations n and this distance measure dist by cause of death in the raw scale (left) and in the log scale (right).

Hints

  • Use group_by() and summarise() on the master_hod data frame to generate n with function sum() and dist by mean((prop - prop_all)^2)

  • Filter this summary for n > 50 and call it devi_cod (deviations by cause of death)

  • Use ggplot() + geom_point() on devi_cod to produce the raw-scale figure (left)

  • Additionally use scale_x_log10(), scale_y_log10(), and geom_smooth(method = "rlm", se = FALSE) to produce the log-scale figure (right)

  • See help for scale_x_log10() to adjust axis labels (look for “comma”)

  • Let’s not worry about grids for the log-scale figure

Fit data by a regression and plot residuals

We reproduce:

The figure is a plot of the regression residuals resid of log(dist) on log(n) against log-scale n. By inspection, the points lying above the holizontal line at resid=1.5 are considered to be “unusual causes of deaths” by the author.

Here the author used the robust linear model (rlm()) regression, but the syntax is similar to the standard linear model regression (lm() ).

Let’s look at a simple example of regression by lm().

df <- data.frame(
  x1 <-  c(1:10),
  y1 <-  c(1,3,2,4,6,5,7,5,7,8)
)

df %>% 
  ggplot(aes(x = x1, y = y1)) + geom_point() +
  geom_smooth(method = "lm", se = FALSE)

#  The geom_smooth() is estimating the following linear regression:
#  y1 = intercept + coefficient * x1 + residual
#  which is  
f1 <- lm(formula = y1 ~ x1,  data = df) 
class(f1)    # class "lm""
## [1] "lm"
summary(f1)  # summary() knows how to summarise an object of class "lm" 
## 
## Call:
## lm(formula = y1 ~ x1, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.52727 -0.57273 -0.02727  0.52273  1.54545 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.0000     0.6924   1.444 0.186656    
## x1            0.6909     0.1116   6.192 0.000262 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.014 on 8 degrees of freedom
## Multiple R-squared:  0.8273, Adjusted R-squared:  0.8058 
## F-statistic: 38.34 on 1 and 8 DF,  p-value: 0.0002618
coefficients(f1)  # coefficient point estimate
## (Intercept)          x1 
##   1.0000000   0.6909091
vcov(f1)          # coefficient variance-covariance matrix
##             (Intercept)          x1
## (Intercept)  0.47939394 -0.06848485
## x1          -0.06848485  0.01245179
predict(f1)       # predicted (fitted) values with the estimated coefficients 
##        1        2        3        4        5        6        7        8 
## 1.690909 2.381818 3.072727 3.763636 4.454545 5.145455 5.836364 6.527273 
##        9       10 
## 7.218182 7.909091
resid(f1)         # residuals:  
##           1           2           3           4           5           6 
## -0.69090909  0.61818182 -1.07272727  0.23636364  1.54545455 -0.14545455 
##           7           8           9          10 
##  1.16363636 -1.52727273 -0.21818182  0.09090909

Run a regression by rlm with formula = log(dist) ~ log(n) and store the residuals in devi_cod data. Plot the residual against log-scale n. We use rlm() to exactly replicate the case study, but in most cases lm() suffices. To read more about linear regressions, see the help file of lm() (type ?lm).

Hints

  • While you should not have missing values in this case, check the dataset devi_cod for missing values in dist and n before running a regression. Most regression functions, including lm() and rlm(), drop any row with missing values for the estimation model. This becomes an issue if we want to add a new column containing predicted values or residuals in the existing dataset. (When rows containing missing values are dropped, the vector generated by predict() or resid() will be shorter than the rows of the original dataset. And when adding this as a new column to the original dataset, you need to match which rows correspond to whcih predicted or residual values.)

  • Use ggplot() + geom_point() structure for the plot

  • Add + scale_x_log10() and + geom_hline(yintercept = 1.5)

Visualise unusual causes of death

We reproduce:

The first figure is the unusual causes of deaths in devi_cod with a relatively large number of deaths (n > 350) and the second is that of a relatively small number of deaths (n <= 350).

Using the cuttoff value resid > 1.5, filter devi_cod and call it unusual data frame. Join master_hod and unusual data frames. Then create two subsets with conditions n > 350 and n <= 350 and plot the data. Hints

  • Use ggplot() + geom_line() structure with + facet_warp(~ disease, ncol = 3)

  • To include the base-line proportions representing the average of all causes of deaths, add another geom_line(aes(...), data = overall_freq) layer with a local aes() argument and a data argument. The data arument can specify another data frame to be combined (assuming the axes have the same units), and we use overall_freq from the panel (d) portion of Table 16 above.

  • last_plot() %+% another_data_frame reproduces a plot of the same structure with a different data frame

The Key

The solution will be presented at the workshop and later posted here.